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Abstract 

Stochastic Dual Coordinate Descent (DCD) has become one of the most efficient ways to solve the 
family of £ 2 -regularized empirical risk minimization problems, including linear SVM, logistic regres¬ 
sion, and many others. The vanilla implementation of DCD is quite slow; however, by maintaining 
primal variables while updating dual variables, the time complexity of DCD can be significantly re¬ 
duced. Such a strategy forms the core algorithm in the widely-used LIBLINEAR package. In this paper, 
we parallelize the DCD algorithms in LIBLINEAR. In recent research, several synchronized parallel 
DCD algorithms have been proposed, however, they fail to achieve good speedup in the shared memory 
multi-core setting. In this paper, we propose a family of asynchronous stochastic dual coordinate descent 
algorithms (PASSCoDe). Each thread repeatedly selects a random dual variable and conducts coordinate 
updates using the primal variables that are stored in the shared memory. We analyze the convergence 
properties when different locking/atomic mechanisms are applied. Eor implementation with atomic op¬ 
erations, we show linear convergence under mild conditions. For implementation without any atomic 
operations or locking, we present the first backward error analysis for PASSCoDe under the multi-core 
environment, showing that the converged solution is the exact solution for a primal problem with per¬ 
turbed regularize!'. Experimental results show that our methods are much faster than previous parallel 
coordinate descent solvers. 


1 Introduction 


Given a set of instance-label pairs (±j, t/j), i = 1, • • • , n, G 
empirical risk minimization problem with £ 2 -regularization: 


iji G M, we focus on the following 


min P{w) := -||m| 
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where Xi = ijiXi, £i(-) is the loss function and || • || is the 2-norm. A large class of machine learning problems 
can be formulated as the above optimization problem. Examples include Support Vector Machines (SVMs), 
logistic regression, ridge regression, and many others. Problem ([T]) is usually called the primal problem, and 
can usually be solved by Stochastic Gradient Descent (SGD) ( [Zhang [ 2004} jShalev-Shwartz et al.[ [2007 1, 
second order methods (Lin et al. 20071, or primal coordinate descent algorithms (Chang et al. 2008 [ Huang 


et al.j 20091. 


Q: 


Instead of solving the primal problem, another class of algorithms solves the following dual problem of 
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where ^*(-) is the conjugate of the loss function defined by £*{u) = maxz{zu — ii{z)). If we define 


w{a) = y^^ajXi, 


( 3 ) 


i=l 


fhen if is known fhaf w{a*) = w* and P{w*) = —D{a*) where w*,a* are fhe opfimal primal/dual 
solufions respectively. Examples include hinge-loss SVM, square hinge SVM and £ 2 -regularized logisfic 
regression. 

Sfochasfic Dual Coordinafe Descenf (DCD) has become fhe mosf widely-used algorifhm for solving Q, 
and if is fasfer fhan primal solvers (including SGD) in many large-scale problems. The success of DCD 
is mainly due fo fhe frick of mainfaining fhe primal variables w based on fhe primal-dual relafionship Q. 
By mainfaining w in memory, Hsieh ef al. ( |2008 1; Keerfhi ef al.| ( |2008 1 showed fhaf fhe time complexify of 
each coordinafe updafe can be reduced from O(nnz) fo 0(nnz/n), where nnz is number of nonzeros in fhe 
fraining dafasef. Several DCD algorifhms for differenf machine learning problems are currenfly implemenfed 


in LIBLINEAR (Ean ef alj] 20081 and fhey are now widely used in bofh academia and indusfry. The success 


of DCD has also cafalyzed a large body of fheorefical sfudies (Nesferov 2012 Shalev-Shwarfz & Zhang 


20131. 


In fhis paper, we parallelize fhe DCD algorifhm in a shared memory mulficore sysfem. There are fwo 
fhreads of work on parallel coordinafe descenf. The firsf fhread focuses on synchronized algorifhms, in¬ 
cluding synchronized CD ( [Richfarik & Taka£ 2012[ Bradley ef al. 20111 and synchronized DCD algo¬ 
rithms ( Yang[ 2013[ Jaggi et ^ 20141. However, choosing the block size is a trade-off problem between 
communication and convergence speed, so synchronous algorithms usually suffer from slower convergence. 
To overcome this problem, the other thread of work focuses on asynchronous CD algorithms in multi-core 
shared memory systems ( Eiu & Wright[ 2014 Eiu et al. 20141. However, none of the existing work main¬ 
tains both the primal and dual variables. As a result, the recent asynchronous CD algorithms end up being 
much slower than the state-of-the-art serial DCD algorithms that maintain both w and a, as in the LIB- 
LINEAR software. This leads to a challenging question: how to maintaining both primal and dual in an 
asynchronous and efficient way? 

In this paper, we propose the first asynchronous dual coordinate descent (PASSCoDe) algorithms with 
the address to the issue for the primal variable maintenance in the shared memory multi-core setting. 
We carefully discuss and analyze three versions of PASSCoDe: PASSCoDe-Lock, PASSCoDe-Atomic, and 
PASSCoDe-Wild. In PASSCoDe-Lock, convergence is always guaranteed but the overhead for locking makes 
it even slower than serial DCD. In PASSCoDe-Atomic, the primal-dual relationship Q is enforced by atomic 
writes to the shared memory; while PASSCoDe-Wild proceeds without any locking and atomic operations, 
as a result of which the relationship Q between primal and dual variables can be violated due to memory 
conflicts. Our contributions can be summarized below: 

• We propose and analyze a family of asynchronous parallelization of the most efficient DCD algorithm: 
PASSCoDe-Lock, PASSCoDe-Atomic, PASSCoDe-Wild. 

• We show linear convergence of PASSCoDe-Atomic under certain conditions. 

• We present a backward error analysis for PASSCoDe-Wild and show that the converged solution is 
the exact solution of a primal problem with a perturbed regularizer. Therefore the performance is 
close-to-optimal on most of the datasets. To best of our knowledge, this is the first attempt to analyze 
a parallel machine learning algorithm with memory conflicts using backward error analysis, which is 
a standard tool in numerical analysis ( Wilkinsonl 19611. 

• Experimental results show that our algorithms {PASSCoDe-Atomic and PASSCoDe-Wild) are much 
faster than existing methods. Eor example, on the webspam dataset, PASSCoDe-Atomic took 2 sec- 
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onds and PASSCoDe-Wild took 1.6 seconds to achieve 99% accuracy, while CoCoA took 11.5 seconds 
using 10 threads and LIBLINEAR took 10 seconds using 1 thread to achieve the same accuracy. 


2 Related Work 


Stochastic Coordinate Descent. Coordinate descent is a classical optimization technique that has been 
studied for a long time (Bertsekas 1999 Luo & Tseng[ 19921. Recently it has enjoyed renewed interest due 
to the success of “stochastic” coordinate descent in real applications ( Hsieh et al.[ 20081 Nesterov(|2012| ). 
In terms of theoretical analysis, the convergence of (cyclic) coordinate descent has been studied for a long 
time ( Luo & Tseng[ 1992[ Bertsekas 1999| ), and the global linear convergence is presented recently under 
certain condition (Saha & Tewari 20131 Wang & Lin[ 20141. 

Stochastic Dual Coordinate Descent. Many recent papers (Hsieh et al. 2008 Yu et al.[ 2011t Shalev- 
Shwartz & Zhang[ 20131 have shown that solving the dual problem using coordinate descent algorithms is 
faster on large-scale datasets. The success of SDCD strongly relies on exploiting the primal-dual relationship 
([^ to speed up the gradient computation in the dual space. DCD has become the state-of-the-art solver 
implemented in LIBLINEAR ( Fan et aL| 2008| . In terms of convergence of dual objective function, some 
standard theoretical guarantees for coordinate descent can be directly applied. Different from standard 
analysis, Shalev-Shwartz & Zhang|(|2013|) presented the convergence rate in terms of duality gap. 


Parallel Stochastic Coordinate Descent. In order to conduct coordinate updates in parallel, Richtarik 
& Takac| (2012) studied the algorithm where each processor updates a randomly selected block (or coordi¬ 
nate) simultaneously, and Bradley et al. ( 2011| l proposed a similar algorithm for -regularized problems. 
Scherrer et al. (2012) studied parallel greedy coordinate descent. However, the above synchronized methods 
usually face a trade-off in choosing the block size. If the block size is small, the load balancing problem 
leads to slow running time. If the block size is large, the convergence speed becomes much slower or 
the algorithm even diverges. These problems can be resolved by developing an asynchronous algorithm. 
Asynchronous coordinate descent has been studied by ( Bertsekas & Tsitsiklis[|1989 l, but they require the 
Hessian to be diagonal dominant in order to establish the convergence. Recently, Liu et al. (20141; Liu 


& Wright (20141 proved linear convergence of asynchronous stochastic coordinate descent algorithms un¬ 


der the essential strong convexity condition and a “bounded staleness” condition, where they consider both 
“consistent read” and “inconsistent read” models. Avron et al. ( 2014[ l showed linear rate of convergence 
for the asynchronous randomized Gaussian-Seidel updates, which is a special case of coordinate descent on 
linear systems. 

Parallel Stochastic Dual Coordinate Descent. For solving (|^, each coordinate updates only requires 
the global primal variables w and one local dual variable Ui, thus algorithms only need to synchronize w. 
Based on this observation, Yang| ( |20T3 1 proposed to update several coordinates or blocks simultaneously 
and update the global w, and Jaggi et al. ( 2014| ) showed that each block can be solved with other approaches 
under the same framework. However, both these parallel DCD methods are synchronized algorithms. 

To the best of our knowledge, this is the first to propose and analyze asynchronous parallel stochastic 
dual coordinate descent methods. By maintaining a primal solution w while updating dual variables, our al¬ 
gorithm is much faster than the previous asynchronous coordinate descent methods of ( |Liu & Wright[ 2014| 
Liu et al.[ 20141 for solving the dual problem (|^. Our algorithms are also faster than synchronized dual co¬ 
ordinate descent methods ( [Yang 2013tp^gi et al.[[2014 ) since the latest values of w can be accessed by all 
the threads. In terms of theoretical contribution, the inconsistent read model in ( Liu & Wright[ 2014| ) cannot 
be directly applied to our algorithm because each update on a* is based on the shared w vector. We further 
show linear convergence for PASSCoDe-Atomic, and study the properties of the converged solution for the 
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wild version of our algorithm (without any locking and atomic operations) using a backward error analysis. 
Our algorithm has been successfully applied to solve the collaborative ranking problem ( [Anonymous[|2015] l. 

3 Algorithms 

3.1 Stochastic Dual Coordinate Descent 

We first describe the Stochastic Dual Coordinate Descent (DCD) algorithm for solving the dual problem Q. 
At each iteration, DCD randomly picks a dual variable ai and updates it by minimizing the one variable 
subproblem (Eq. Q in Algorithm 1). Without exploiting the structure of the quadratic term, the subproblems 
require substantial computation (need O(nnz) time), where nnz is the total number of nonzero elements in 
the training data. However, if w{a) that satisfies Q is maintained in memory, the subproblem D{a + dsi) 
can be written as 


D{ct + Sei) = -\\w + dxiW"^ + + 5)), 


and the optimal solution can be computed by 

0 = arg mm - (o + t , — r 
<5 2^ xA 


^ ^2 


+ 


Xi 




Note that all ||®j|| can be pre-computed and are constants. For each coordinate update we only need to 
solve a simple one-variable subproblem, and the main computation is in computing w^Xi, which requires 
0(nnz/n) time. For SVM problems, the subproblem has a closed form solution, while for logistic regression 
problems it has to be solved by an iterative solver (see Yu et al.| ( |2012| l for details). The DCD algorithm, 
which is part of the popular LI BL IN EAR package, is described in Algorithm 


Algorithm 1 Stochastic Dual Coordinate Descent {DCD) 

Input: Initial ex and w = O'*®* 

1: while not converged do 
2: Randomly pick i 

3 : Update Oj ^ aj + Aaj, where 

Aaj ^ argmin -||to + 5xi\\^ + + <5)) 

6 2 


( 4 ) 


4 : Update why w w + AoiXi 

5 : end while 


3.2 Asynchronous Stochastic Dual Coordinate Descent 

To parallelize DCD in a shared memory multi-core system, we propose a family of Asynchronous Stochastic 
Dual Coordinate Descent (PASSCoDe) algorithms. PASSCoDe is very simple but effective. Each thread 
repeatedly run the updates (steps 2 to 4) in Algorithm [T] using w, ex, and training data stored in a shared 
memory. The threads do not need to coordinate or synchronize their iterations. The details are shown in 
Algorithm]^ 


4 















Algorithm 2 Parallel Asynchronous Stochastic dual Co-ordinate Descent (PASSCoDe) 
Input: Initial a and w = 

Each thread repeatedly performs the following updates: 
step 1: Randomly pick i 
step 2: Update a* ^ a* -|- Aoj, where 

Aai ^ argmin -|- dxi\\‘^ + + 6)) 

S 2 


( 5 ) 


step 3: Update why w ^ w + AaiXi 


Table 1: Scaling of PASSCoDe algorithms. We present the run time (in seconds) for each algorithm on the 
rcvl dataset with 100 iterations, and the speedup of each method over the serial DCD algorithm (2x means 
it is two times faster than the serial algorithm). 

Number of threads Lock Atomic Wild 


2 

98.03s / 0.27x 

15.28s/ 1.75x 

14.08s / 1.90x 

4 

106.1 Is/0.25x 

8.35s/3.20x 

7.61s / 3.50x 

10 

114.43s/0.23x 

3.86s / 6.91x 

3.59s / 7.43x 


Although PASSCoDe is a simple extension of DCD in a multi-core setting, there are many options in 
terms of locking/atomic operations for each step, and these choices lead to variations in speed and conver¬ 
gence properties, as we will show in this paper. 

Note that the Aai obtained by subproblem Q is exactly the same as Q in Algorithm if only one 
thread is involved. However, when there are multiple threads, the w vector may not be the latest one since 
some other threads have not completed the writes in step 3. 

PASSCoDe-Lock. To ensure w = ^ • aiXi for the latest o:, we have to lock the following variables 
between step 1 and 2: 

step 1.5: lock variables in Ni := {wt \ {xi)t 0}. 


The locks are then released after step 3. With this locking mechanism, PASSCoDe-Lock will be serializable, 
i.e., generate the same solution sequence with the serial DCD. Unfortunately, threads will waste a lot of time 
due to the locks, so PASSCoDe-Lock is very slow comparing to the non-locking version (and even slower 
than the serial version of DCD). See Table[T]for details. 

PASSCoDe-Atomic. The above locking scheme is to ensure that each thread updates a, based on the 
latest w values. However, as shown in (Niu et al. 2011 Liu & Wright[ 20141, the effect of using slightly 
stale values is usually limited in practice. Therefore, we propose an PASSCoDe-Atomic algorithm that 
avoids locking all the variables in Ni simultaneously. Instead, each thread just reads the current w values 
from memory without any locking. In practice (see Section we observe that the convergence speed is 
not significantly affected by using values of w. However, to ensure that the limit point of the algorithm is 
still the global optimizer of O- the equation w* = J2i has to be maintained. Therefore, we apply the 
following “atomic writes” in step 3: 


step 3: Lor each j £ N{i) 

Update Wj <— wj -\- Aai{xi)j atomically 
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Table 

much 


2: The performance of PASSCoDe-Wild using w or w for prediction. Results show that w yields 


better prediction accuracy, which justifies our theoretical analysis in Section 4.2 


# threads 

Prediction A 

w w 

ccuracy (%) by 

LIBLINEAR 

4 

news20 

O 

97.1 96.1 

97.2 93.3 

97.1 

covtype o 

O 

67.8 38.0 

67.6 38.0 

66.3 

, 4 

rcvl j 

97.7 97.5 

97.7 97.4 

97.7 

webspam ^ 

O 

99.1 93.1 

99.1 88.4 

99.1 

kddb o 

O 

88.8 79.7 

88.8 87.7 

88.8 


PASSCoDe-Atomic is much faster than PASSCoDe-Lock as shown in Table [U since the atomic writes for a 
single variable is much faster than locking all the variables. However, the convergence of PASSCoDe-Atomic 
is not guaranteed by any previous convergence analysis. To bridge this gap between practice and theory, we 
prove linear convergence of PASSCoDe-Atomic under certain conditions in Section]^ 

PASSCoDe-Wild. Finally, we consider Algorithm without any locks and atomic operations. The 
resulting algorithm, PASSCoDe-Wild, is faster than PASSCoDe-Atomic and PASSCoDe-Lock and can achieve 
almost linear speedup using a single processing unit. However, due to the memory conflicts in step 3, some 
of the ’’updates” to w will be over-written by other threads. As a result, the w and a outputted by the 
algorithm usually do not satisfy Eq Q: 

w := ( 6 ) 

i 

where fh, ct are the primal and dual variables outputted by the algorithm, and w defined in Q is computed 
from a.. It is easy to see that a is not the optimal solution of Q. Due to the same reason, in the prediction 
phase it is not clear whether worw should be used. To answer this question, in Section|^we show that w is 
actually the optimal solution of a perturbed primal problem ([T]l using a backward error analysis, where the 
loss function is the same and the regularization term is slightly perturbed. As a result, the prediction should 
be done using w, and this also yields much better performance in practice, as shown in Table |^below. 

We summarize the behavior of the three algorithms in Figure [T] Using locks, the algorithm PASSCoDe- 
Lock is serializable but very slow (even slower than the serial DCD). In the other extreme, the wild version 
without any lock and atomic operation has very good speed up, but the behavior can be totally different 
from the serial DCD. Luckily, in Section]^ we provide the convergence guarantee for PASSCoDe-Atomic, 
and apply a backward error analysis to show that PASSCoDe-Wild will converge to the solution with the 
same loss function with a slightly perturbed regularizes 

3.3 Implementation Details 

Deadlock Avoidance. Without a proper implementation, the deadlock can arise in PASSCoDe-Lock because 
a thread needs to acquire all the locks associated with Ni. A simple way to avoid deadlock is by associating 
an ordering for all the locks such that each thread follows the same ordering to acquire the locks. 
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Locks Atomic Ops Nothing 
Scaling: Poor ^^ Good 

Serializability: Perfect ^^ Poor 


Figure 1: Spectrum for the choice of mechanism to avoid memory conflicts for PASSCoDe. 


Random Permutation. In LIBLINEAR, the random sampling (step 2) of Algorithmis replaced by the 
index from a random permutation, such that each a* can be selected in n steps in stead of n log n steps 
in expectation. Random permutation can be easily implemented asynchronously for Algorithm as fol¬ 
lows. Initially, given p threads, {1,..., n} is randomly partitioned into p blocks. Then, each thread can 
asynchronously generate the random permutation on its own block of variables. 

Shrinking Heuristic. For loss such as hinge and squared-hinge, the optimal a* is usually sparse. Based 
on this property, a shrinking strategy was proposed by |Hsieh et al.| (2008l to further speed up DCD. This 
heuristic is also implemented in LIBLINEAR. The idea is to maintain an active set by skipping variables 
which tend to be fixed. This heuristic can also be implemented in Algorithm|^by maintaining an active set 
for each thread. 

Thread Affinity. The memory design of most modern multi-core machines is non-uniform memory access 
(NUMA), where a core has faster memory access to its local memory socket. To reduce possible latency 
due to the remote socket access, we should bind each thread to a physical core and allocate data in its local 
memory. Note that the current OpenMP does not support this functionality for thread affinity. Library such 
as libnuma can be used to enforce thread affinity. 


4 Convergence Analysis 

In this section we formally analyze the convergence properties of our proposed algorithms in Section 
Note that all the proofs can be found in the Appendix. We assign a global counter j for the total number 
of updates, and the index i{j) denotes the component selected at step j. We define {q^, ol^, ... } to be the 
sequence generated by our algorithms, and 

= agj - «g). 

The update Aaj at iteration j is obtained by solving 

Aaj ^ argmm^llm^ -f f -1-5)), 

where is the current w in the memory. We use to denote the “accurate” w at iteration j. 

In the PASSCoDe-Lock setting, is ensured by using the locks. However, in PASSCoDe-Atomic 

and PASSCoDe-Wild, because some of the updates have not been written into the shared memory. 

To capture this phenomenon, we define to be the set of all “updates to w” before iteration j: 

Z^ := {{t, k)\t <j,k€ N{i{t))}, 

where N{i{t)) := {u \ 0} is all nonzero features in We define W C Z^ to be the updates 

that have already been written into . Therefore, we have 

{t,k)&W 


1 










4.1 Linear Convergence of PASSCoDe-Atomic 

In PASSCoDe-Atomic, we assume all the updates before the (j — T)-th iteration has been written into , 
therefore, 


Assumption 1. The set W satisfies CW C ZK 


Now we define some constants used in our theoretical analysis. Note that X G is the data matrix, 
and we use X G to denote the normalized data matrix where each row is xf = xf /||a;j|p. We then 
define 


Mi = max 
SC[d] 


X.,^tXi^t\\, M = maxMj, 
tes 


where [d] d} is fhe sef of all fhe feafure indices, and X-^t is the f-th column of X. We also define 

Lmax to be the Lipschitz constant of D{-) within the level set {o: | D{a) < Z)(q;°)}, Rmin = minj ||a;j|p, 
Rmax = maxj ||®i|p. We assume that Rmax = 1 and there is no zero training sample, so Rmin > 0. 

To prove the convergence of asynchronous algorithms, we first show that the expected step size does not 
increase super-linearly by the following Lemma [T] 


Lemma 1. If t is small enough such that 


{6t{t + l)‘^eM)/y/n < 1, 


(V) 


then PASSCoDe-Atomic satisfies the following inequality: 

E{\\a^-^ - ) < pE{\\a^ - ), ( 8 ) 

where /?=(!+ 

The detailed proof is in Appendix |A.2| We use a similar technique as in ( Liu & Wrightj[2014 1 to prove 
this lemma, but the proof is different from ( |Liu & Wright[ 20141 because 

• Their “inconsistent read” model assumes ^ • oqXi for some a. However, in our case may 

not be written in this form due to incomplete updates in step 3 of Algorithm]^ 

• In (Liu & Wright 20141, each coordinate is updated by 'yS/tfia) with a fixed sfep size 7. We consider 
fhe case fhaf each subproblem (|^ is solved exacfly. 

To show fhe linear convergence of our algorifhms, we assume fhe objecfive funcfion ([^ safisfies fhe follow¬ 
ing properfy: 

Definition 1, The objective function (|^ admits the global error bound if there is a constant k such that 


\a - Ps(q:)|| < K\\T{a) - q:||, 


( 9 ) 


where Ps{') is the projection to the set of optimal solutions, and T : RA —)• R^ is the operator defined by 

Tt{ot) = argmin D{a + {u — at)et) Vf = 1,..., n. 

U 

The objective function satisfies the global error bound from the beginning if 0 holds for all cx satisfying 

D{a) < D{a^) 


where ct® is the initial point. 













This definition is a generalized version of Definition 6 in (Wang & Lin 20141. We list several important 
machine learning problems that admit global error bounds: 


Support Vector Machines (SVM) with hinge loss ( |Boser et^L||1992 1: 

= C max(l — Zj, 0) 

—Oii if 0 < ai < C, 


iU-ai) = 


oo 


otherwise. 


( 10 ) 


• Support Vector Machines (SVM) with square hinge loss: 

= C max(l — Zi, 0)^. 
s _/-«* +af/4C7 ifai>0, 

) j-u ' (H) 

\oo Otherwise. 

Note that C > 0 is the penalty parameter that controls the weights between loss and regularization. 

Theorem 1. The Support Vector Machines (SVM) with hinge loss or square hinge loss satisfy the global 
error bound 

Proof. For SVM with hinge loss, each element of the mapping T(-) can be written as 

Tt{a) = argmin D(a + (u — at)et) 

U 

= argmin ^||m(Q:) + (u - at)xt\\‘^ + t(-u) 

u 2 

= lix(--TTT^- ) = tlx( 11^ 112 j) 






& Lin 2014 1, we can show that for all t = 1,..., n 


where fix is the projection to the set X, and for hinge-loss SVM X := [0, C]. Using Lemma 26 in (Wang 

)|ai-nx(ViD(a))| 


|«t-nx( ||^j|2 )l 


1 


> min(l, — 2 —)\at - Ilx{VtD{cx)) \ 

1 Lyy 


>\at - nx(VtD(Q:))|, 
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Table 3: Data statistics, n is the number of test instances, d is the average nnz per instance. 



n 

h 

d 

d 

C 

news20 

16,000 

3,996 

1,355,191 

455.5 

2 

covtype 

500,000 

81,012 

54 

11.9 

0.0625 

rcvl 

677,399 

20,242 

47,236 

73.2 

1 

webspam 

280,000 

70,000 

16,609,143 

3727.7 

1 

kddb 

19,264,097 

748,401 

29,890,095 

29.4 

1 


where the last inequality is due to the assumption that Rmax = 1- Therefore, 

lla - r(Q:)||2 > ^||q: - T(q!)||i 


n 


1 " 


t=l 

= 1 ||V+D(cr)||i 

/n 


> —||V+L>(q) 


> 


Koy/n 


\a - Psia) 


where V^D{a) is the projected gradient defined in Definition 5 of (Wang & Lin 2014 1 and kq is the k 
defined in Theorem 18 of (Wang & Lin 2014 1. Thus, wifh k = Ko^/n, we obfain fhaf fhe dual funcfion of 
fhe hinge-loss SVM safisfies fhe global error bound defined in Definifion[T] Similarly, we can show fhaf fhe 
SVM wifh squared-hinge loss safisfies fhe global error bound. □ 

Nexf we explicifly sfafe fhe linear convergence guarantee for PASSCoDe-Atomic. 

Theorem 2. Assume the objective function ([^ admits a global error bound from the beginning and the 
Lipschitz constant L^ax is finite in the level set. If 0 holds and 


1 > 


2Lmnx, ctM 

(1 -h —^)(-) 


“ Rl 


n 


n 


then PASSCoDe-Atomic has a global linear convergence rate in expectation, that is, 

E[D{a^+^)] - D{a*) < rj {E[D{a^)] - D{a*)) , 
where cr* is the optimal solution and 


p = 1 — 


Lr 


a ‘2LjYiax 

P2 


ctM 


n 


n 


( 12 ) 


(13) 
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4.2 Backward Error Analysis for PASSCoDe-Wild 

In PASSCoDe-Wild, assume the sequence {ct^ converges to a and {m-^} converges to w. Now we show 
that the dual solution a and the corresponding primal variables w = X]r=i “*** actually the dual and 
primal solutions of a perturbed problem: 

Theorem 3. a is the optimal solution of a perturbed dual problem 



(14) 


OL 


2 = 1 


and w = btiXi is the solution of the corresponding primal problem: 



(15) 


where e G is given by e = w — w. 


Proof By definition, a is the limit point of PASSCoDe-Wild. Therefore, {Aoj} —>• 0 for all i. Combining 


with the fact that {m-^} —)■ rh, we have 


w'^Xi € VT 


Since th = to — e, we have 


{w - ef'xi G dait^{-ai), Mi 

-nFxi G dai (£• (-Oj) - oiie^Xi) , Mi 



which is the optimality condition of ( [T4l ). Thus, a is the optimal solution of ( pA] ). 

For the second part of the theorem, let’s consider the following equivalent primal problem and its La- 
grangian: 




2=1 

The corresponding convex version of the dual function can be derived as follows. 

D{a) = max —L{w, a.) 




n 
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The last second equality comes from 1) the substitution of w* = Yli=i obtained by setting V^u — 
L{w, q) = 0; 2) the definition of the conjugate function Thus, the second part of the theorem 

follows. □ 


Note that e is the error caused by the memory conflicts. From Theorem w is the optimal solution 
of the “biased” primal problem ( fT5| ), however, in ( [T5] ) the actual model that fits the loss function should be 
w = w — e. Therefore after the training process we should use w to predict, which is the w we maintained 
during the parallel coordinate descent updates. Replacing w by w — e in ( [T5] ), we have the following 
corollary : 


Corollary 1. w computed by PASSCoDe-Wild is the solution of the following perturbed primal problem: 


1 " 

w = argmin -(w + + e) + lAw^Xi) 

w 2 ^^ 


(16) 


i=l 


The above corollary shows that the computed primal solution w is actually the exact solution of a per¬ 
turbed problem (where the perturbation is on the regularizer). This strategy (of showing that the computed 
solution to a problem is the exact solution of a perturbe d problem) is insp ired by the backward error analysis 
technique commonly employed in numerical analysis (Wilkinson 196l[p~ 


5 Experimental Results 


We conduct several experiments and show that the proposed PASSCoDe-Atomic and PASSCoDe-Wild have 
superior performance compared to other state-of-the-art parallel coordinate descent algorithms. We consider 
the hinge loss and five datasets: news20, COVtype, rcvl, webspam, and kddb. Detailed information is 
shown in Tablej^ To have a fair comparison, we implement all compared methods in C-i-i- using OpenMP as 
the parallel programming framework. All the experiments are performed on an Intel multi-core dual-socket 
machine with 256 GB memory. Each socket is associated with 10 computation cores. We explicitly enforce 
that all the threads use cores from the same socket to avoid inter-socket communication. Our codes will be 
publicly available. We focus on solving the (hinge loss) SVM (see (5) in the Appendix) in the experiments, 
but the algorithms can also be applied to other objective functions. Note that some of the figures are in 
Appendix 6 . 

Serial Baselines. 

• DCD\ we implement Algorithm [T] Instead of sampling with replacement, a random permutation is 
used to enforce random sampling without replacement. 

• LIBLINEAR: we use the implementation in http : / / www .csie.ntu.edu.tw/-cjlin/lib linear. 
This implementation is equivalent to DCD with the shrinking strategy. 

Compared Parallel Implementation. 

• PASSCoDe: We implement the proposed three variants of Algorithm using DCD as the building 
block: Wild, Atomic, and Lock. 


CoCoA: We implement a multi-core version of CoCoA (Jaggi et al. 20141 with /3k = 1 and DCD as 
its local dual method. 

AsySCD: We follow the description in ( Liu & Wright[ 2014| Liu et al.[ 2014 1 to implement AsySCD 
with the step length 7=5 and the shuffling period p = 10 as suggested in (Liu et al. 20141. 


*J. H. Wilkinson received the Turing Award in 1970, partly for his work on backward error analysis 
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5.1 Convergence in terms of iterations. 


The primal objective function value is used to determine the convergence. Note that we still use P{w) for 
PASSCoDe-Wild, although the true primal objective should be ( fTh] ). As long as w^e remains small enough, 
the trend of ( [T^ and P{w) are similar. 

Figure |4(a)[ |5(a)[ |6(a)| show the convergence results of PASSCoDe-Wild, PASSCoDe-Atomic, CoCoA, 
and AsySCD with 10 threads in terms of number of iterations. The horizontal line in grey indicates the 
primal objective function value obtained by LIBLINEAR using the default stopping condition. The result 
for LIBLINEAR is also included for reference. We have the follow observations 

• Convergence of three PASSCoDe variants are almost identical and very close to the convergence 
behavior of serial LIBLINEAR on three large sparse datasets (rcvl, webspam, and kddb). 

• PASSCoDe-Wild and PASSCoDe-Atomic converge significantly faster than CoCoA. 

• On COVtype, a more dense dataset, all three algorithms {PASSCoDe-Wild, PASSCoDe-Atomic, and 
CoCoA) have slower convergence. 


5.2 Efficiency. 

Timing. To have a fair comparison, we include both initialization and computation into the timing results. 
For DCD, PASSCoDe, CoCoA, initialization takes one pass of entire data matrix (which is 0{nnz{X))) 
to compute ||a;i|| for each instance. In the initialization stage, AsySCD requires 0{n x nnz{X)) time and 
0{v?‘) space to form and store the Hessian matrix Q for Q. Thus, we only have results on news20 for 
AsySCD as all other datasets are too large for AsySCD to fit Q in even 256 GB memory. Note that we also 
parallelize the initialization part for each algorithm in our implementation to have a fair comparison. 

Figures 2(b) 3(b) 4(b) 5(b) 6(b) show the primal objective values in terms of time and Figures |2(c)[ 
3(c) 4(c)[ 5(c)[ 6(c) shows the accuracy in terms of time. Note that the x-axis for news20, COVtype, and 
rcvl is in log-scale. A horizontal line in gray in each figure denofes fhe objecfive values/accuracy obfained 
by LIBLINEAR using fhe defaulf stopping condifion. We have fhe following observations: 

• From Figures 4(b) and 4(c)[ we can see fhaf AsySCD is orders of magnifude slower fhan ofher ap¬ 
proaches including parallel mefhods and serial reference {AsySCD using 10 cores lakes 0.4 seconds 
to run 10 iterations, while all fhe ofher parallel approaches lakes less fhan 0.14 seconds, and LIBLIN¬ 
EAR lakes less fhan 0.3 seconds). In facl, AsySCD is slill slower fhan ofher mefhods even when fhe 
inifializalion lime is excluded. This is expecfed because AsySCD is a parallel version of a sfandard 
coordinale descenf melhod, which is known to be much slower fhan DCD for Q. Since AsySCD runs 
ouf of memory for all fhe ofher larger dafasels, we do nol show fhe resulls in ofher figures. 

• In mosl figures, bolh PASSCoDe approaches oulperform CoCoA. In Figure 6 (c)[ kddb shows heller 
accuracy performance in fhe early slage which can be explained by fhe ensemble nalure of CoCoA. In 
fhe long term, if still converges to fhe accuracy obfained by LIBLINEAR. 

• For all dafasels, PASSCoDe-Wild is shown lo be slighlly fasler fhan PASSCoDe-Atomic. Given fhe facl 
lhal bolh mefhods show similar convergence in lerms of ileralions, Ihis phenomenon can be explained 
by fhe effecl of atomic operations. We can observe lhal more dense fhe dalasel, larger fhe difference 
belween PASSCoDe-Wild and PASSCoDe-Atomic. 


5.3 Speedup 

We are inleresled in fhe following evaluafion crilerion: 

lime faken by fhe largel melhod wifh p Ihreads 


speedup := 


lime faken by fhe besl serial reference melhod 
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This criterion is different from scaling, where the denominator is replaced by “time taken for the target 
method with single thread.” Note that a method can have perfect scaling but very poor speedup. Figures 
2(d)[|3(d)1 4(d)[ 5(d)[|6(d) shows the speedup results, where 1) DCD is used as the best serial reference; 2) 


the shrinking heuristic is turned off for all PASSCoDe and DCD to have fair comparison; 3) the initialization 
time is excluded from the computation of speedup. 

• PASSCoDe-Wild has very good speedup performance compared to other approaches. It achieves about 
6 to 8 speedups using 10 threads on all the datasets. 

• From Figure |2(d)[ we can see that AsySCD does not have any “speedup” over the serial reference, 
although it is shown to have almost linear scaling (|Liu et al. 2014t Liu & Wright[|2014l. 


6 Conclusions 

In this paper, we present a family of parallel asynchronous stochastic dual coordinate descent algorithms 
in the shared memory multi-core setting, where each thread repeatedly selects a random dual variable and 
conducts coordinate updates using the primal variables that are stored in the shared memory. We analyze 
the convergence properties when different locking/atomic mechanism is used. For the setting with atomic 
updates, we show the linear convergence under certain condition. For the setting without any lock or atomic 
write, which achieves the best speed up, we present a backward error analysis to show that the primal 
variables obtained by the algorithm is the exact solution for a primal problem with perturbed regularizer. 
Experimental results show that our algorithms are much faster than previous parallel coordinate descent 
solvers. 
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Figure 2: news20 dataset 


Figure 3: COVtype dataset 


Figure 4: rcvl dataset 
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Figure 5: webspam dataset 


Figure 6: kddb dataset 
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A Linear Convergence for PASSCoDe-Atomic 

A.l Notations and Prepositions 

A. 1.1 Notations 

• For alH = 1,..., n, we have the following definitions: 

m-y) 


hi{u) : = - 


\Xi 


1 


proXj(s) :=argmin -{u — s)^ + hi{u) 

u 2 

Ti{w,s) :=argmin l-\\w + {u - s)xi\\‘^ +£*{-u) 

u 2 


= arg min - 

u 2 


u — [s — 


T 12 
W Xi 


\Xi 


+ hi{u), 


where w ^ and s ^ R. We also denote prox(s) as the proximal operator from to R^ such that 
(prox(a;))j = proXj(sj). We can see the connection of the above operator and the proximal operator: 


Ti{w,s) = proXi(s - 


Let } and {wA be the sequence generated/maintained by Algorithmj^using 


J+i _ /“i) if f 

ifi / *(j)> 


a-i = 


cxi 


where i{j) is the index selected at j-th iteration. For convenience, we define 


J *u) *u) 


• Let {a-^} be the sequence defined by 

= Tt{w\ a{) Vt = 1,..., n. 
Note that and = prox(Q;'^ — Xw^). 

• Let ctiXi be the “true” primal variables corresponding to qL 


A. 1.2 Prepositions 
Preposition 1. 

— a:f iP) = —— Q:^ |p. (17) 

Proof. It can be proved by the definition of a and the assumption that i{j) is uniformly random selected 
from {1 ,..., n}. □ 
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Preposition 2. 


\\Xw^ - Xw^\ <mY^ \Aat 

t=j-T 


(18) 


Proof. 


\\Xw^ -Xw^\\ = \\X{ (Aai)Xi(,),fcefc)|| = II {Aat)X:,kXi^t),k\ 

{t,k)&Zo\Ui {t,k)&Zi\W 

j-r i -1 

< ^ \Aat\Mi <M Y, |Aat| 


t=j-i 


t=j-T 


Preposition 3. For any wi,W 2 G and si,S 2 G R, 

\Ti{wi, Si) - Ti{w2, S2)| < |si - S2 + 


{wi - W2)'^Xi 


Xi 


□ 


(19) 


Proof. It can be proved by the connection of Ti{w, s) and proxj(-) and the non-expansiveness of the proxi¬ 
mal operator. □ 

Preposition 4. Let M > 1, q = > P = (1 + qY, and 9 = Y7t=i If M >1 and q{T -|- 1) < 1, 

then < e, and 

4 -h 4M -h AMO 




n 


( 20 ) 


Proof. By the definition of p and the condition g(r -|- 1) < 1, we have 

i/g\ g(T+i) 


P 


(t+1)/2 _ / 1/2 




By the definitions of q, we know that 

1/2 _ 6 (r-H)eM 3 ^ ^Jn{p^/^ - 1) 
^ ^2 4(r -|- l)eM 


We can derive 


q = p 

3 _ - 1) 

2 4(r -|- l)eM 


n 


< 


< 


< 


- 1 ) 

4(r-h l)p(^+i)/2M 

\/n(p^/^ - 1 ) 

4(1 -h6')pV2M 

x/^(l-p-'/ 2 ) 

4(1 + 0)M 
Vnjl- p-^) 

A{1 + 9)M 


.,/r+l)/ 2 <g 

1 -h 6* = ^ < (r -h l)p^/2 


t=o 


p-V2 < 1 
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Combining the condition that M > 1 and 1 + 0 > 1, we have 

Vn{l- p-^)-4: ^ ^/n(l - p-^) 

4(1+ 6 »)M - 4(1+ 6 ')M 2“ ’ 

which leads to 


4(1 + 6)M < y/n — y/np ^ 


-4 


1 4 + 4M + AMO 

-^-• 

\/n 


□ 


A.2 Proof of Lemma [T] 

Similar to |Liu & Wright] ( |2014 ), we prove Eq. ([ 8 ]) by induction. First, we know that for any two vectors a 
and h, we have 

llalP — llblP < 2||a||||b — all. 


See Liu & Wright] ( |2014| l for a proof for the above inequality. Thus, for all j , we have 

||q;-^'“^ — — ||q:-^ — < 2||q:-^“^ — q:-^||||q:'^ — — cx^~^ + Q-^'| 

The second factor in the r.h.s of ( |2T] ) is bounded as follows: 

||Q,i _ Q;i+1 _ Q-f”! -|- Q;-J || 

< ||q:-^' — a:-^~^|| + || prox(Q;-^' — Xw^) — prox(a-^“^ — Xw^~^)\\ 

< \\cx^ — Q:-^~^|| + ||(q:-^' — Xw^) — — X'u;-^'“^)|| 

< ||q:-^ — Q:-^~^|| + llcf^ — Q:'^~^|| + \\Xw^ — Xw^~^\\ 

= 2\\a^ — Q:-^'~^|| + \\Xw^ — Xw^~^\\ 

= 2\\a.^ — Q:-^'~^|| + \\Xw^ — Xw^ + Xw^ — Xw^~^ + Xw^~^ — X'u;-^“^|| 


v-t — rvf I 


< 2 ||q:'^ — -^11 + || 2 fru-^ — ^|| + || + ||Xm-^ ^ — Xw^ 

1-1 1-2 

< (2 + M)||a^'||Aat||M+ ||Aat||M 

t=j — T t=j — T—l 

1-2 

= (2 + 2M)||q:^'-Q^“ i||+2M ^ ||Aat|| 

t=j-T-l 


- 1 | 


( 21 ) 


( 22 ) 


Now we prove ([^ by induction. 

Induction Hypothesis. Due to Preposition [T| we prove the following equivalent statement. For all j. 


E{\\a^-^ - a^f) < pE{\\a^ - 

Induction Basis. When j = 1, 

||q;^ — — q^II < (2 + 2M)\\a^ — Q:°||. 


( 23 ) 
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By taking the expectation on ( [2T] ), we have 

< 2 i?[||Q:‘^ — q:^||||q^ — dP' — cx^ + Q:^||] 

< (4 + AM)E{\\oP — Q^ll|| q;'^ — II). 

From ( [TT] ) we have S[||q:‘^ — ct^lP] = ^||q:‘^ — «^|P- Also, by AM-GM inequality, for any Hi, H 2 
any c > 0 , we have 

// 1 F 2 < ^ic^ll+C~^^J.2)■ 

Therefore, we have 


Therefore, 


which implies 


F^IIIq:^ — q^IIIIq:^ — a^l 


< -E 
- 2 

= ^E 
2 


nV 2 ||« 0 -«i 2 +„-l/ 2 ||^l_^ 0||2 


-V2||a0_al||2 + n-V2||«l_c,0||2 

= n~^^'^E[\\cP — ri^lp]. 


by ([T^) 


.,1 a.2||2i ^ ^ + Fni^.O a, 1||2 i 

/n 


E[\\a° - - E[\\a^ - ^ E[\\a^ - ], 


.E[||Q°-«^f] < 


1 


F;[||ai-a2||2]<pi5;[||a^-a2||2]^ 


1 ;^ 2 || 2 i 


^ _ 4+4M 


where the last inequality is based on Preposition]^ and the fact 6M > 1. 
Induction Step. By the induction hypothesis, we assume 


£^[||q;* — Q:*||^] < pi?[||Q* — 11^] Vt < j — 1. 


The goal is to show 

£'[||q:'^~^ — A-^lP] < pF^lllo:'^ — Q-^'’'^|p]. 

First, we show that for all t < j, 




ij-l-t)l2 

< ^ E 

E 

Q* — Q-^ ^ — Ct^W 



Jn L 


— 


2 


> 0 and 
(24) 


(25) 

(26) 

(27) 
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Proof. By (^i with c = where (3 = 

E 


Iq* — Q:*“'"^|| IIq;-^ ^ — aP 


n 


< 

- 2 

= -E 
2 

= -E 
2 

< -E 
- 2 

< -E 
- 2 

0-l-t)/2 

< ' 


V2/3||a* - a*+i||2 + n-V2/3-i||ai-i _ 
j3E[\\of — Q:*^^||^] + 


n 


n 


n 


by Preposition [T] 


^ ^\\oP ^ —a-^'lp + n ^ — a^\\‘^ byEq. (26l 


^ — dP\\'^+n ^ — oP\ 


by the definition of /3 


-E 


n 


|q,J l_Q,i||2 


Let 6 = We have 

^[||a^-^ - Q^'f ] - E\\\a^ - 

< E 


i-i 


2||a^'-^-Q^'||((2 + 2M)||a^'-Q^-i||+2M ^ ||« 

t=j-T-l 

i-1 

(4 + 4M)^(||q;^-^ - a^'lllla^'- Q^-^ll)+4M ^ S 

t=J-T-l 




by ([g, @ 


(yj 1 _ Qjill IIq,* _ Q;* 1| 


i -2 


□ 


< (4 + 4M)n ^^“^ElWa^ — ^|p]+4Mn ^^"^ElWa^ ^ — p^-^' ^ by (pT]) 

t=jr-l-r 

< {A + AM)n-^/‘^E[\\dL^ - cx^-^f]+ AMn-^/‘^eE[\\cx^-^ - OL^f] 

4 + 4M + 4M6' 


< 


n 


-E[\\c,^-^-c,^\% 


which implies that 


E[\\cx^-^ -OL^f] < 


1 


4+4M+4M6> 


1 - 


E[\\cx^ - < pL;[||a^' - 


\/n 


where the last inequality is based on Preposition]^ 


A.3 Proof of Theorem |2] 

First, we define T{w, a) to be a n-dimensional vector such that 

{T{w, a))t = Tt{w, at) for all t, 
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We can then bound the distance E[\\T{w^, a^) — T{w^, Q:-^)|p] by (we omit the expectation in the following 
derivation): 

n 

\\T{w^,cx^) - T{w^,cx^)f = - Tt{w^,al)f 

t=i 

E Aw^ — W^)'^Xf.2 I-I 

i- --) (By Proposition!^ 

= \\X{w^ — rh'^)|p 

<M^( ^ — q:*||)^ (By Proposition!^ 

t=j-T 

i -1 

<tM^{ Y, 

t=j-T 

T 

< tM‘^ ( (ByLemma!T]l 

t=i 

^ t=l 

< - - pA\T(wAaA-a^f 

n 

Since p('^+i)/2 < e, we have < e^, so p'^ < A since p > 1. Therefore, 

\\T{wAaA -T{wAcxAf < - \\T{w\aA-am‘^. ( 28 ) 

n 

As a result, 


\\T{wA OiA — = \\T{wA ctA — T{wA q -^) + T{wAol^) — 

< 2{\\T{wAcxA -T{wAcxAf + \\T{wAolA - cx^f) 

eVM2, 


< 2(1 + 


n 




(29) 


Next, we bound the decrease of objective function value by 

D{oiA - L»(a^+i) = D{cy.A - T»(a^'+i) + D{a^+^) - D{cy.^+^) 


l®*(i)ll^ 


> -Ti(^j){wAoc> 


Lrf. 


i{j) 




wA CC-^llP 
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So 


E[Dia^)]-E[D{a^+^)] > ^Emw^, - -^E[\\T{w^,cx^) - cx^f] 

r>2 T 2 ji/r2 2 

> ^£;[||^(^o^a^■) - a^f] - -^——E[\\T(w^,a^) - a^'f] 


Rl 


2LmaxT‘^M‘^e^ . erM, 


> E^e[\\T{w\cx^) - ^ (1 + ^-^)E[\\T{w\cc^) 


— a 


i|| 2 l 


d 2 

^ '^min I 2 _ 


2Lmax . erM, ,T‘^M‘^e^ 


V 


(l + ^)( 


n n 


E[\\T{w^,a^) - a 


J|| 2 l 


Let 6 = (1 - 2 ^imi3l(i and combine the above inequality with eq @ we have 


E[D{cx^)] - E[D{a^^^)] > bKE[\\a^ - Ps{a^ 

hn 

> -^E[D{cx^) - D* 

^m.n.T. 


Therefore, we have 


E[D{a^^^)] -D* = E[D{a^)] - {E[D{a^)] - E[D{a^^^)]) - D* 


<(l--^)(£| 0 (aJ)]-D*). 

^max 
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